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The present paper outlines two alternative strategies 
for evaluating teaching effectiveness. These are: (1) within-subject 
reversal designs, and (2) multiple baseline testing procedures. Each 
design is discussed in terms of its application to research problems 
in higher education. In reversal designs, the student is exposed to 
different teaching procedures during successive phases of a course. 
Changes in performance between treatments are analyzed, either on the 
basis of group averages or in terms of individual Performances. The 
reversal design becomes even more powerful if a second group of 
students goes through the treatments in opposite order, thus 
counterbalancing the two groups for possible changes in difficulty 
across conditions. In the multiple baseline testing procedure, 
students are given a comprehensive examination before instruction 
begins, and at the end of each successive phase of the course. This 
allows the instructor to demonstrate that changes in performance are 
functionally r«^lated to specific teaching procedures introduced 
during each phase. Furthermore, it provides a continuous baseline 
measure over material that has been trained. Percentage gains over 
baseline levels can be used to measure differential effects of 
diffe^rent teaching procedures. Similar to the reversal design, the 
multiple baseline design allows the researcher to make statements 
about the effects of each procedure on individual students. 
(Author/DB) 
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ABSTRAgr 

Researchers in high >r education have traditionally used control- 
group/cxporlmentel-group tecbnlqnea to ccaapei^tra dlffe3:ront teaching 
procedures. Such expariiaental designs ^ howQWTp frequently fall to 
account for irdividual difference© in studest perfacmnce and Btud«>nt 
preference. The present papor outlines two altarnatlve strateglea 
for evaluating teaching effect Ivi&ness^ These Incliad^i a) within- ^ 
sul>ject r«v®rsal designs j and \>) multiple bOBellne tostijjg procedurea^ 
ESich design will discussed in terms of its application to research 
problema in higher education. 

In reversal d'^algos, tha etvident ia ©xposed to different teaching 
procedures during successive phases of a ocfuree. Changes in perfonaance 
"between treatments are analysscd either on the basis of group averages , 
or in terms of individual perf anoaances . The reversal design becomes 
even acre powerful if a sooorsd group of sttadenta goes ttrcougb the 
treatments in opposite occder^ thus countortaloncing tha two groupo 
for poissible changes in difficulty across conditions. 

In the multiple baseline testing parocediurOp students are given a 
comprehensive examination bef coc^ it^trtiction begins ^ and at tiie end of 
eaxjh successive phase of the course^ This alloi^a the lastruotca: to 
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demonstrate that clmng«s in perfcsrimtio^ aro f^inctloaal]^ rolated to 
©paolfic teaching proccidures Irtroduced dxsrlng ©och pltKiWJo^ Furth^rmcor^, 
It providoQ a continuous baseline meaeuro oroi; natexial vhioh has not 
be on trained, and a retention measuxs over material that has be^n 
trained. Percentage ga^ns over bwelin© Icrols can bo used to aeasu^® 
dlffarontlal offocts cf dlfforent teaching porocedureso Sitailar to the 
ravranaal des.ifen, the smltiple ba«olln© desl^ allows the researcher to 
make statements about the effects of each procedur® on individual 
students . 
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In 1968^ Dubln and Taveggia (1968) pulxLlshed a comprohonQlve, 
oompairatlv© ana3ys3j5 of sevteoral different tefwjhlog m^thodo - l^otxnx*, 
loctur^-Klis CUBS Ion, tutoorlal and Indepeijdexrt solf-^tudy. Rather tbaaa 
accept authooca' conclusions, thoy a:eamlys©d tho data from soma 9I 
stt841<25s conducted "b^sttweea 1925 and. I965, The results? Tbaxa wre no 
la^asuxable diffex^noee i-waong truly distlt.otlvQ methods of instruction 
whisn evaluated V student performance on midtorm and fjjial oxaalnatioBS « 
Surptrlslng? Diacs^uraglng? Expected? It all depends. 

Perhaps laldtera and final examinations do notp aa somo have ttrg^ied^, 
a^Jeusur© nhat vrae ^oa^^ taiaght, which places their existence in quest Ion, 
Porhs-ps mldikenae and f inaUs ace characteri^Bed "by too lauch variability to 
discriminate evon large effects, Pertopsp as Hil^ard (Sji Dubln and 
Tavogglai I968) has noted, the way tn rrhlch ccurae matorlaj^ axe 
progremod, usually throngh a textbook froa ^fhich tost items are 
derix'^d, are so powerful that they override differences In teaching. 
Perhaps J too^ the problem is the way and rigor with which teaching 
methods hav® been ai^lyssed and evaluated ♦ That is^ the inability to 
f ix»i differences in the effectiveness of different teaching tQchnologios 
may lie in th^ expsrlmont«>l strategies used to a'^isess those technJ.qr.ee • 
Host likely, thero is no single causa, but rather a Emltlple-^ojxe 
cojabinatlon of all of the abovo. The purpose of thle presentation is 
to outline some alternative strategies for research in college teaching 
which place greater emphasis on individual student perfonaanc© and 
preference, and which hav3 shown that some variables do make a dlfferenc^^. 
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static Group DesjUro 

Host college teachlog reaeaxch has used statla groap oxperladiital 
ciesi&iis to compare different teachir^g nethodfi. IVplcallyt one group of 
students is exposed to teaching method A and a Becoi>d group to tev^chlng 
method B» The two groups (or more) are compared on the teaie of 
perfcacffianca on midterm exajtos, a final and/or student coarse ratings. 
Differences betKeen groups are typically assessed using any of a variety 
of statistical t cists to determine If they are sl^jnif leant at sooe 
^predetermined protobility leval. Significant is an unfartmiate word 
"because it is scscet lines used to Imply important. For exaaple, the only 
statistically significant result repocrted Pubin and Taveggia (1968) 
was tlmt students who studied course material performed better than a 
group of control students Kho did not take the course. Although this 
is an appropsriate control procedure, it is hardly an impccttant source 
of inf conization for an analysis of effective college teaching procedures. 

This Is not meaitt to deny the uaefeilness of statistics as a tool, 
but rather to suggest that researchers need not be slaves to them. Group 
statistics are moot useful Mhen Ke ask actuarial questions of the form - 
what percentage of students prefer X, Y or cr what percentage of 
students withdraw from a given coarse. They are also valuable when we 
attempt to standardise test materials by determining what propcrtlon 
of students answer an item or items correctly following a standard 
method of instruction. However, when we use static group designs to 
compare student performances cr preferences, we may indeed f in^ that 
differences among students far outweigh the effects generated ly 
different teaching procedures. 
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single Sub ^ject Ana^sls 

In the pxesent paper , ne kIU describe axsd Illustrate the use of 
tHO majoa: within-subject or within-growp experimental designs > ones 
which I think hIU enable us to achieve a more precise analysis of 
college teaching techniques. Single subject flmalysls typically involves 
Aonltoring a student behavicc^ f^equehtly throughout the term, say 
ten to thirty times in ccBiparison with traditional designs which 
assess student perfcocmance etnd/ar preference infrequexitDy • Ve have 
Keller's (1968) system of personalized inotmiction to thanJt for small 
units of material which enable researchers to measure student behavior 
at frequent intervals ♦ 

In recent yeare. personalized systems of instruction have received 
a great deal of popular and empirical support. Contraacy to the history 
of college teaching research portrayed "by Dubin and Taveggla (1968)^ 
several invest igat cars (Born,, Gledhill and Davis, 1972 f KoMlchael and 
Corey, 1969| Sheppard and MacDermot, 1970) have demonstrated that 
personallaod instruction produces significantly better examination 
performance end higher student ratings than mere traditional, lecture- 
discuss Ion methods. It is not my purpose here to advocate or criticize 
personalized Instruction, but rather to borrow ftom it soae features 
which may help us better evaluate our teaching methods. 
Small Units of Hj^terlal 

One salient feature of personalised and programmed instruction is 
the division of course materials into several small parts or imlts« In 
our introductaxy child development course at the University of Kansas 
(which we teach ly personalized instruction), we have sixteen unit 
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qjnizzea aiii a flm?. examination which Is analyzed In 1^6 correapondlng 
parts. In addition to embllng a metro precise analysis of student 
l)ehavlor, the use of short units also Improves test performance 
(Seat, 1973). Vhen stxwlents Here required to study material for four 
units caofitlned and then tested cn an achievement test over the material, 
they performed about worse than when they studied and were tested 
over four units separately. Thus, ther« Is a data base as well as an 
experimental rationale for using relatively short units of material. 
Vfllth ln -Sub^^goi Designs • Rationale 

The use of wlthln-subject and within*-group experimental designs 
In college teaching reseaurch Is relatively new. Carapbell and Stanley 
(1963) refer to these designs ax3 quasl«-6xperlAental time series analyses. 
They have been used widely in operant (behavior analysis) research with 
both animals and children. The two major designs we will consider In 
the present paper are the ABA reversal design and the multiple baseline 
design. In both, the sc^ne subject or group of subjects are exposed to 
two or more treatment conditions In successive order. Thus, one major 
source of variability characteristic of static group designs 9 between 
subject, variability, is ellmijiated. Another obvious advajatage of single 
subject analysis is the emptesls placed on the individual student ♦ 
Revey^al Des;ifns 

In the typical ABA reversal design, the behavior of Interest 
(e,g.9 student pcrfonnance or preferenoe) is measured frequently over 
time to establish a baseline as a basis for forecasting what the level 
of the behavior will be In the future (Rlsley end Wolf, 1972 )• Thus, 
as Fig. 1 shows, one can analyse the trend of the data, a feature 
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Insert Fig, 1 about here 

which is just ao important as the absolute magnitude of the effect • 
For example^ Fig. 2 tshovrs the sajne magnitude of effect as Fig, 1, 

Insert Fig, 2 *ibout here 

bat the trend is obviously different. Given the increasing level 
of the behavior in Condition A (Baseline)^ it would be difficult to 
argue that the increase observed during B would pot have oncurred had 
the experLnentad condition not been introduced. 

To continue our discussion of the reversal design, consider the 
data presented in Fig, 3t In this experiment (Semb, Hopkins and Hursh, 



Insert Fig, 3 about here 

1973) » the instructor measured student test performance on two types of 
items - study question items and non-'study (probe) items. During the 
ba/?ollne condition, students earned points fotr correct answers to quit 
questions which contriltited to their grade in the course. Notice the 
advantage of short unite - it enabled the instructor to obtain four 
data points before a second condition (Noncontingent Points) was 
introduced. During the Noncontingent Point condition, students were 
given the paxljiuin nuabccr of quia-points > jj^f ore they took the quiz* 
This was done to determine whether or not points awarded for ccorrect 
responses promoted better performance. As Fig, 3 shows^ there was a 
substantial decrease in mean student performance on both dependent 
variables. We could have stopped the experiment at this point, and 
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V6 talght have "been relatively convinced that ••free grades** lead to 

i 

poorer test perf cxmance^ However, In aiy time series analysis, It could 
have been due to a variety of ether factoors. For example, It cciild "be 
that during the none ont Inge nt point condition the material In Unite 8-9 
or Quizzes P-9 were more difficult, or that students did not like the 
material, car that they were getting tired of the course. To increase 
our confidence that there tfas a functional relationship "botneen points 
awarded for correct responses ai^ perfairaance, the original coniltlon 
(Baseline) was reinstated, thus completing the ABA sequence. Ulth 
repeated reversals we would add even more to the certainty that our 
majilpulation actually produced or caused the observed changes in 
performancd. 

Notice, however, that the AM reversal design, when applied to 
perfcrmnnce variables, suffers fro® a potential source of confouwiing. 
Because content and test Items in each successive condition are 
different, it could he that changes in perfQrmance are due to changes 
in the difficulty of the contend; and/or test Iteais. One way to control 
for this possible iiource of ccrifoiu:idlng is to "standardise" test items 
on another population of students who are exposed to the same trestment 
throughout the term (Semb, 1972 1 Seab, Hopkins and Hursh, 1973), or to 
use a panel of experts (e.g., graduate students). With a standardisBed 
pool of Iteas, it is possible to select test questions with the sane 
nean and rangfi from condition to condition or unit to unit. Believe 
me, this la no eary task, and It involves an additional assumption - 
namely, that the difficulty of those items remains constant ftoo one 
term to the next. Furthermore, caution must be taken to prevent items 
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fro« "beoottlng paxt of student test files. 

An alternative my to oontrol for changea In Item or content 
difficulty Is to expose a second group of students to the treatsients 
In opposite order frc^ the first group* BostoK and Blumenfeld (1972) 
have used this design (see Fig« and we make frequent use of It In 

Insert Fig. 4 about here 

our present experiments. We call It a counterbalanced reversal design. 
Both Klthln and between group comparisons axe possible ^ Becsuae the 
two groups or subjects are In opposite treatments over the saae 
content (counter balanced )f within group comparisons between successive 
tre&t&ents enable one to state that chainges In performance or preference 
are a function of ths treatment condition and not the content covered 
or test items used. 

ivnother application of the reversal design Is presented In Fig. 5» 

Insert Fig. 5 about here 

Miller, Weaver azad Semb (1973) assigned target datee^ backed up ly a 
course withdrawal contingency (pass the quiz \jj the target date or 
withdraw from the cou^^^se) during the Initial c::ondltlon. Students 
averaged a little over one quiz completed per day. During the next 
condition, no target dates or contingencies were Imposed on performance^ 
and students' rates of lesson ccnpletion decreased considerably. To 
establish the functionail relationship between target dates backed up 
ly the course withdrawal contingency and student rates of lesson 
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completion, the initial coziditlon was r<Jiiistated during the third 
phas© of th« course. 

The reversa.1 design is^lso nicely suited for meamirlng student 
profOTences toward different teaching methodB, It seema reasonable 
to suppose that if we iraurt students to rate one method rf ^sm another 
that He expcee him to both. This is certainly a major advantage of 
single subject exporlraental analyslSp l:)eoaiise in such an analysis the 
student 1b orposed at laaot once to both nethodjs of inatructione If 
you are not satisfied with students' verbal reports such as course 
rating scales or questionnaires, thero is yet another alternative with 
the reversal design. Everything remali^ the same as before in th© first 
two (AB) treatments^ However, during the final condltionp the studexxt 
is given a choice between A and B. This is an extremely strong measure 
of student preference in that regardless of what the student say a, he 
or she must commit himself to one method or the other « Her© we encounter 
the actuarial advantages of a large group of students - what propoortion 
prefer each method? For an example of this research design, see liursh^ 
Hildgen, Mlnklnp Minkln, Sherman ai^ tfolf (1973) • 

The reversal design is a powerful technique for analyzing individual 
student behavior « However, It has two major drawbaclas. First, some 
behavior changes are not reversible* For example, once you havo 
successfully taught a particular concept ^ there is no way to ra verse 
the student's history. No number of reversals will jcroduco the orlglnaO 
level of the "tohavlor* Second^ in scan© cases, a reversal may be 
politically or ethically unwLse, Alternatively, a manipulation laay 
be so averslve that a return to the original condition occurs almost 
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Immediately, For example ^ tKO years ago (Sam"b, 1972) I wanted to 
Increase attendance at optional discussion olaases following quiz 
days. Initially, ve returned qulzz6B to students via a dletril^utlon 
center which waa open all day. See Fig. 6 lal>eled baseline. To 

Insert Pig, 6 about htsre 

prcttote bettor attendance, ho decided to return qiaizzea for the It 30 
lecture group only during discussion Bectlons* A© Pig, 6 shorst 
attendance increased. What it does not show ifi the wrath k8 encountered 
raad, threatening students emd an onslaUnSht of phone calls. We had. 
pleixmed to iniplement the same procedure Hlth the second group later 
in the semester^ but scrapped the Idea and r^urned to the initial 
condition for the lt30 group, SometlBioa, the situation is Just the 
opposite. One atarts with something bjydp goes to sonethinf^ good, a«d 
thon, for whatever reason, does not want to return to the had, 
Fortruiately, there is a design,, the mu^Ltlple l>aseline design, which 
can be used to establish ca,usallty over time without the necessity 
of a reversal. 
Multiple Baseline Dealiscn 

In a multiple baseline design, t»^o or mare behaviors of the same 
subject or group of subjects ?re measiored at the same time. After 
initial baseline measures axe obtained on both, the experimental 
manipulation Is applied to only one Ti)ehavicr, The change obtained is 
compared with the level forecast for; that behavior frcwn its baseline. 
The confidence in our forecast is directly related to the level of the 
baseline of the second and subsequent behaviors being measured. 
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Conalder Fig, 7 (Mlllor and Weavor, 1972). Several times dvirlng 

. Insert Fig a 7 about here 

the course (shovn on the ateclesa), Miller and Weaver adiriinJUstcred the 
saae, comprehemlve, f lU-in-tbo-tilank achievement tost, Durln^j the 
initial testing, students sccared low on all five pactions of the test 
(shOKn on separate graphs on the ordinate As the course progressed^ 
Increases In perf oarmance were directly related to that part of the 
course uhich had just previously "been tralnedt Notice, too, that 
"because the test was ccanprehenslir© across the entire course^ It allowed 
the researchers to obtain retention measures on material which had Tjeen 
presented early in the course,- 

The mxiltiple baseline design Is somewhat weaker tlian the reversal 
design in that It Involves an additional assumption - najrioly, that all 
measured behaviors are Busce^tlblo to the saae manipulations* This 
assumption can be suppartedp however, by applying the same procedure 
to many successive bebsvioors (Kisley and Wolf, 1972). The comparison 
of interest in each is betxfeen boceline (pre-trsinlng) and treatment 
(post --training), not between sets of behaviors. The remaining behaviors 
serve to support the baseline focrec&st of the first behavior. Notice 
in Fig. 7 that the multiple "fcasoline design atHows the researcher to 
establieh trei^ as well as mean effects. Performance on Conditioned 
Reinf orceaexit material during basallne (pretrslning) incrciasod coxwisteDtly, 
perhaps b«dca\ise the similarity between conditioned reinf arconent and 
concepts presented earlier in the course was high. 
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There are several types of multiple teseline deslgzid. The two 
most useful variations In college teaching research Include the one 
Just described In vhlch performance on all phaaies of course content 
are measured at frequent Intervals with each part serving as a separate 
baseline. The second involves using tKO or more subjects or groups of 
subjects. The experimental treatment is introduced successively far 
each group at different times. Fcr exaaplo, assume that each of the 
five coordinates shewn in Pig. 7 represented a different group of 
sttidents. At different points during the course, the experimental 
treatment is introduced first for the top group, then for the next 
group, ai3d so on. Changes in behaviocr associated Kith each irjtro- 
duction of the independent variable increase our confidence that the 
manipulation produced the effect* 

The Miller and Weaver (1972) multiple baseline achievement test 
can also be extended to compare the effectiveness of different teaching 
procedures. For example, in a recent eacperlnont (Semb, 1973) i I used 
the same type of comprehensive test administered five tjjaes during the 
semester. The data for ttro groups of students who went through the 
experimental treatments in different order are shoim i% Flg» 8. Although 

Insert Fig. 6 about here 

each treatment produced increases in performance on both study quertion 
items and non-^study (probe) items, the magnitude of effect was not 
the same for each treatment. Furthermore^ the graph is difficult to 
read, so we computed percentaige gains in performance by subtracting 
pre-test scares Crom tirainlng scores. These data are presented in 
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Fig. 9. Notice ttet the Increaaes in performance over pre-test levels 

Insert Fig, 9 about here 

were greatest fccr the high grading criterion - short assignment condition 
on both types of items. Thus, it was possible, using the multiple 
baseline testing procedure, to desionstrate the functlon2Q. relationship 
between each teaching package and performance , and furthermore, to 
compare different teaching methods as to their effect ivenoss. 

Unfortunately, we have had to describe these experiments hastily, 
I would encourage those of you who are Interested in using single 
subject analyses of the type described here to refer to the articles 
referenced for more detail. Let me reiterate my belief that single 
subject analysis has a great deal to offer researchers la higher 
education. Both reversal and multiple baseline designs involve applying 
two or more treatments to the same IMlvidual or gcoup over time, thus 
enabling a precise analysis of individual performance and/or preference. 
Second, both allow the researcher to make statements about functional 
relationships between experimental treatments and observed changes in 
behavior. Finally, both have been demonstrated by the research results 
presented here to be viable tools for college teaching research. 
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Fifcare Cai^tlons 

Fl^. 1 - In the tmsaline coridltlon the behavior of interest is first 
measured repeatedly over time* Then under the experimental 
condition the new level of behavioa: is measured repeatedly and 
compared with the level that would have l»on forecast from the 
baseline measure. (From Rlsley and Wolf, 1972) 

Flg^ 2 - The same mean effects presented in Fig. 1 now plotted in a 
manner such that the trends in the behavior under each condition 
are apparent. (From Rlsley and Wolf, 1972) 

Pig. 3 - Mean percent of quiz items answered correctly for study and 

non-study question items during baseline and noncontlngent points. 
(Prom Semb^ Hopkins a^jd Hursht 1973) 

Pig. ^ - An example of a countertetlanced reversal desl^. {Froai 
Boetow and Blumenfeld, 1972) 

^ig* 5 The mean number of lessojis completed by each student during 

each day of the semester. The dashed lines represent mean lessons 
completed per day for each of the three experimental conditions. 
(From Miller, Weaver and Semb, 1973) 

Plgt 6 - Mean percent dally attendance at optional discussion sections 
for the 1200 and 1:30 lecture groups. The two experimental 
conditions included baseline and quiz return. Arrows iwiicate 
days on which hour exajno were given in lectvore. (From Semb, 1972) 

Pig, 7 - Average scores on esoh subsection of the achievement test* 
Dotted line indicates introduction of teaching package lea: that 
subsection. (From Miller and MeaveXt 1972) 
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Fig. 8 - Mean porcent coccrect on study question items (closed clroleo) 
and pro"be item (open circles) on each of the five Achievement 
Tests, Group^^^ls plotted on the left and Group 2 on the rl^fhtt 
The foui^ course parts are plotted vertically on separate oocdlnates 
and the five Achievement Tests are shovm on the atecissa, 
(Prom Semh, 1973) 

Fig* 9 - Mean percentage gains over pretest levels for each of the 

four parts of thsy course^ Study items are represented ty closed 
circles and prol>e items hy open circles. Vertical lines through 
each point indicate standard deviations. (From Semb, 1973) 
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